Genomics in the Cloud: GATK, Spark, and Docker

Genomics in the Cloud: GATK, Spark, and Docker

作者: Brian D. O'Connor Geraldine van der Auwera
出版社: O'Reilly
出版在: 2020-04-21
ISBN-13: 9781491975190
ISBN-10: 1491975199
裝訂格式: Paperback
總頁數: 475 頁





內容描述


Data in the genomics field is booming. In just a few years, organizations such as the National Institutes of Health (NIH) will host 50+ petabytes—or 52.4 million gigabytes—of genomic data, and they’re turning to cloud infrastructure to make that data available to the research community. How do you adapt analysis tools and protocols to access and analyze that data in the cloud?
With this practical book, researchers will learn how to work with genomics algorithms using open source tools including the Genome Analysis Toolkit (GATK), Docker, WDL, and Terra. Brian O’Connor of the UC Santa Cruz Genomics Institute and Geraldine Van der Auwera, longtime custodian of the GATK user community, guide you through the process. You’ll learn by working with real data and genomics algorithms from the field.
This book takes you through:

Essential genomics and computing technology background
Basic cloud computing operations
Getting started with GATK
Three major GATK best practices for variant discovery pipelines
Automating analysis with scripted workflows using WDL and Cromwell
Scaling up workflow execution in the cloud, including parallelization and cost optimization
Interactive analysis in the cloud using Jupyter notebooks
Secure collaboration and computational reproducibility using Terra


作者介紹


Dr. Geraldine A. Van der Auwera is the Director of Outreach and Communication for the Data Sciences Platform (DSP) at the Broad Institute of MIT and Harvard. As part of her outreach role, she serves as an educator and advocate for researchers who use DSP software and services including GATK, the Broad's industry-leading toolkit for variant discovery analysis; the Cromwell/WDL workflow management system; and Terra.bio, a cloud-based analysis platform that integrates computational resources, methods repository and data management in a user-friendly environment. Van der Auwera was originally trained as a microbiologist, earning her Ph.D. in Biological Engineering from the Université catholique de Louvain (UCL) in Belgium in 2007, then surviving a 4-year postdoctoral stint at Harvard Medical School. She joined the Broad Institute in 2012 to become Benevolent Dictator For Life of the GATK user community, leaving behind the bench and pipette work forever.
Dr. Brian O’Connor is the Technical Director of the UCSC Genomics Institute Analysis Core. There he focuses on the development and deployment of large-scale, cloud-based systems for analyzing genomics data. This includes the Toil workflow execution platform, which is designed to run genomic pipelines on a wide range of cloud environments including AWS, Azure, Google and OpenStack, and ADAM, a distributed genomics platform developed in collaboration with UC Berkeley. He is also the co-chair of the Containers and Workflows task team of the Global Alliance for Genomics and Health (GA4GH) where he works on tool and workflow container standards. Brian recently joined UCSC from the Ontario Institute for Cancer Research (OICR) where his previous projects included leading the technical implementation of cloud-based analysis systems for the PanCancer Analysis of Whole Genomes (PCAWG) effort, the creation of the Dockstore project (http://dockstore.org), and the development of the International Cancer Genome Consortium’s Data Portal (http://dcc.icgc.org).




相關書籍

AWS Certified Solutions Architect – Associate Guide: The ultimate exam guide to AWS Solutions Architect certification

作者 Gabriel Ramirez Stuart Scott

2020-04-21

Kubernetes 快速入門

作者 Nigel Poulton

2020-04-21

Aws Certified Developer - Associate (Dva-C01) Cert Guide

作者 Sluga Marko

2020-04-21